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1 Introduction 

Hadamard conjugation is an analytic formulation of the relationship between the probabili- 
ties of expected site patterns of nucleotides for a set of homologous nucleotide sequences and 
the parameters of some simple models of sequence evolution on a proposed phylogeny T. 
An important application of these relations is to give a theoretical tool to analyse properties 
of phylogenetic inference, such as the methods of maximum likelihood and maximum parsi- 
mony, as well as being a tool for generating simulated data, and determining phylogenetic 
invariants. Hadamard conjugation can also be used as directly as for phylogenetic inference, 
inferring either trees with the Closest Tree algorithm ^] or networks using Spectronet 
P. 

Hadamard conjugation was first introduced in 1989 [HI IE] to analyse two-state character 
sequences evolving under the Neyman model j!5j . Evans and Speed in 1993 j3] noted that 
Kimura's three substitution types (K3ST) model jH] for 4-state characters could be modelled 
by the Klein group Z2 x Z2. Noting this Szekely et al |21| l2*2~] extended the two-state analysis 
to a more general algebraic theory, where substitutions belonged to an arbitrary Abelian 
group. They then applied this to sequences evolving under the K3ST model. Current 
applications of Closest Tree and Spectronet JT] are usually applied to the 4— state K3ST 
model or its derivatives, the K2ST and Jukes-Cantor models. 
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A pathset in a phylogenetic tree T, is a generalisation of the concept of a paths. This 
approach allows the concept of pairwise distances between sequences to be extended to dis- 
tances connecting larger sets of taxa. It provides properties that can be related to other 
models, such as the molecular clock hypothesis. This has, for example, proved pivotal in 
allowing a simpler analytic expression of the likelihood function, as developed in jl], lead- 
ing to an algebraic solution for the maximum likelihood points. It has also proved useful 
in identifying phylogenetic invariants jU], and to the introduction of projected spectra [23] 
which reduces both the variance in the parameter estimates, and the computational com- 
plexity of the Closest Tree algorithm [7] . Each of the above examples rely on some identities 
between the phylogenetic tree and the probabilities of obtaining sequences evolved under 
that tree. However, these identities were never directly proved. Here we provide for the first 
time, a direct proof for these identities. Effectively, this is an alternative proof of Hadamard 
conjugation for the K3ST model, where practical interpretations of the intermediate terms 
are developed, showing directly the relationships between the topology of T and the sub- 
stitution probabilities across its edges. This is an important contribution that can serve in 
the burgeoning area of algebraic statistics in biology and phylogenetics, in particular (see 

e.g. m 121 El US H3 1201 )• 

We model the relationship of the differences of n sequences labeled 1, 2, • ■ ■ , n, from a refer- 
ence sequence labeled 0. Because the models are reversible, the choice of reference sequence 
is arbitrary. The topology of T and the model parameters are presented in a sparse matrix 
Qt of 2 n rows and columns, called the edge-length spectrum. The probabilities of each site 
pattern are presented in a similar sized matrix called the sequence probability spectrum. 
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We also define a Hadamard matrix H n of 2™ rows and columns, and show that the matrix 
products 

H n QxH n , H n PTH n , 

both relate to properties of path-sets. We prove the major result by interpreting correspond- 
ing components of each entry of these matrices. 

In earlier representations [TUl HH] the Hadamard conjugations for K3ST were presented as 
a conjugations of vectors of 4 n components by the Hadamard matrices i^2n of 4 n rows and 
columns. In the formulation presented here the vectors are replaced by matrices of 2 n rows 
and columns, which pre- and post-multiplied by H n , a Hadamard matrix of the same order. 



2 Kimura's 3ST model 



Kimura's [H| three substitution types model (K3ST) specified independent rates, a, (3 and 
7, for each of three substitution types between the RNA or DNA nucleotides. Here we will 
refer to these substitutions as: 

t a : the substitutions A <-> G, U(T) <-> C (transitions); 

tp: the substitutions A <-» U(T) , G <-> C (transversions type /?); 

i 7 : the substitutions A <-> C, U(T) <-> G (transversions type 7). 
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By including the identity t e , we find the set of substitutions 




is a group under composition, which acts on the nucleotide set {A, C, G, T 



(U)}- 



Observation 1 (T, o) is isomorphic to the Klein 4— group, (Z2 x Z2,+2)- 



□ 



Kimura modelled the expected differences between two sequences separated by time t. With 
the three specified rates, the expected numbers of substitutions of each type are therefore 



The number of substitutions of each type observed between homologous nucleotides of the 
two sequences can be used to estimate the probabilities p(a), p{(3), and ^(7) of each type 
occurring. By setting (3 = 7, or a — (3 = 7, this model projects to Kimura's better known 
two substitution type model [IS], or to the simple Jukes/Cantor model [T2] . 

Kimura derived expressions for the expected numbers as functions of the probabilities. These 
are equivalent to the standard expression of the rate matrix R derived from the stochastic 
matrix M, over time t, 



q(a) = at, q((3) = (3t, 5(7) = 7*. 



M = exp(ftt), 



(1) 
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where, with K = q(a) + q(/3) + g( 7 ) being 



the total number of substitutions, 



M 



p(e) p(a) p(0) p( 7 ) 
p(a) p(e) p( 7 ) p((3) 
p((3) p( 7 ) p(e) p(a) 

p(t) p(P) P( a ) P( e ) 

Let H 2 be the 4x4 Hadamard matrix 



Ho 



Rt 



-K q(a) q((3) q( 7 ) 

q(a) -K g( 7 ) 
g( 7 ) 

5(7) ?(/?) ?(«) 



1-1 1-1 
1 1-1-1 
1-1-1 1 



Observation 2 iJ 2 diagonalises both M and Rt. In particular 



H 2 l MH 2 



and 



Ho ^RtH 2 — — 2 











1 - 2p(a) - 2p( 7 ) 

1 - 2p(P) - 2p( 7 ) 








1 - 2p(a) - 2p{(3) 



















q(a) + q(-f) 

q{(3)+q{l) 

q(a) + q(P) 



Hence from equation ^ we find 

1 - 2(p(a) + p( 7 )) = p(e) - p(a) + p(/3) - p( 7 ) = e' 2 ^^ = e-^M+^-sM 

1 - 2(p(f3) + p( 7 )) = p(e) + p(a) - p((3) - p( 7 ) = e - 2(9(/3)+gW = e -*+«( a >-«W-«W, 

1 - 2(p(a) + p(/3)) = p(e) - p(a) - p(/3) + p( 7 ) = e - 2(9(a)+9(/3) = e -*-«< 0[ >-«<# + «W , 

which can be succinctly expressed as 

tff 1 /^ = Exp^f 1 ^), (2) 

where 



#1 = 


f 1 


, p = 


p(e) 


p(a) 




-if g(a) 




f -1 




p(/3) 


p(t) 




q(P) q(l) 



and Exp is the exponential function to each entry of the matrix. Equation |2] can be inverted 
(provided the arguments of In are all positive) to give 

H^QH X = LniH^PHt), (3) 

where Ln is the natural logarithm applied to each component of the matrix. 

The invertibility of equations 121 and El mean that provided the parameters are in the valid 
ranges, the model could be specified by the three probabilities p(a), p(/3) and p( 7 ), or by 
the three parameters q(a), q{f3) and q(j). Indeed, when we do this, we do not need to rely 
on a rate/time specification and a Poisson process of substitution. 



7 



3 Substitutions across the edges of a tree 



Let T be a tree (phylogeny) with leaf set L(T) = {0, 1, . . . , n}, and edge set E(T). We can 
postulate three independent Kimura probability parameters p e (oi), p e (/3) and p e {j) for each 
edge e G E(T) and a transition matrix 



M e = 





p e (a) 


Pe((3) 


Pe{l) 


p e (a) 


Pe(e) 


Pe{l) 


Pe(P) 


Pe(P) 


Pe{l) 


Pe(e) 


p e (a) 


Pe{i) 




Pe(a) 


Pe(e) 



Suppose we assign nucleotides to each vertex of T according to a model parameterised by 
these probabilities for each edge e G E(T). The matrices M e for each e G -E(T) are all 
diagonalised by H 2 , and hence commute, so for any subset W C £'(T) of edges we can define 

M w = H M e , (4) 

the transition matrix representing the probabilities of change concatenated across the edges 
in W. 

We observe 

H 2 l M w H 2 = H 2 l I \{M e \H 2 

\eeW ) 

= Yl H 2 x M e E 2 , 

is a diagonal matrix whose entries are the products of the corresponding eigenvalues of the 
factor matrices M e . 
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We can define corresponding 2x2 matrices 



Pi 



w 



Pw{£) Pw{a) 
Pw(P) Pw(l) 



Q 



w 



-K w qw{oi) 



writing P e for P{ e }, etc. Hence, from equation HJ the entries qw(a), qw(P) an d qw(l) of Qw 
are linear functions of the logarithms of these eigenvalues, and we find 



Observation 3 



Qw — ^ Qe- 



□ 



Because of this linearity, we define g e (a), q e (P) an d q e (l), to be the three edge-length pa- 
rameters, for each edge e, and can specify our model by the 3|-E(T)| independent parameters 



q e (8): ^{«,A 7 };e6£(T). 



The deletion of an edge e G E(T) induces two subtrees, whose leaf label sets partition [n]o 

into two subsets. We choose that subset A 6 [n] ( the subset not containing 0) to index e 

as tA- We incorporate the edge-length parameters into three vectors q a , and q 7 indexed 

by the 2 n subsets of [n], where for A C [n] 

'q eA (a) if e A eE(T), 

Ma = < - Ee B ei?(r) ^ B if A = 0, 
, else, 

with similar structures for q^ and q 7 . The entries in these vectors are ordered by the subsets 
of [n] listed lexicographically: 0, {!}, {2}, {1,2}, {3}, {1,3}, {2, 3}, {1, 2, 3}, {4}, ■ • • , etc. 
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We will also find it convenient to gather these three vectors into a 2 n x 2 n matrix 

Qt = [QA,B} AB c[ n ] ' 

where 



' Qe A {a) 


if e A G£(T),5 = 0, 


Qe B (P) 


if A = 0, eg G E(T), 


< 9ca(7) 


\iA = B,e A e E(T), 


-K t 


if A = 5 = 0, 


> 


else, 



and 

^T= (9e(«)+5e(/5)+5e(7))= E K " 

e<=E(T) e&E(T) 

Thus the leading column of Qt is q a , the leading row is q^, and the leading column is q 7 , all 
other entries are 0, apart from the leading entry which is — Kt (hence the sum of all entries 
of Qt is 0). Qt is referred to as the edge length spectrum for T. The positive entries of 
this spectrum identify the edges of T. 

If we propose a sequence of nucleotides at leaf 0, then we can generate homologous sequences 
at each of the other leaves under this model. A common position in each of these sequences 
is called a site. If in an instance of such sequences, the character states at leaves 0,1, ... ,n 
are x(0), x(l)> • • • > which partitions [n] into the subsets 

S e = {ie [n]:t e ( X (0)) =*(*)}> for 6 6 {e, a, (3, 7}. 

Thus for example S$ is the set of leaves of [n] with the same state as at 0. We index a site 
pattern by (A, B), the pair of subsets of [n], where 

A = S a U Sj, B = Sp U Sj, 
10 



-K(a) 




-#(/?) 




-K{i) 


qi(a) 




9i 09) 




9i(7) 


92(a) 




92 09) 




92(7) 





,q(/3) = 





,q 7 = 





93(a) 




9s 09) 




93(7) 


913(a) 




913 (J3) 




913(7) 















9123(a) 




9123 {0) 




9123(7) 



Qi 



-K 31(a) 32(a) 33(a) 313(a) 3123(a) 

9i(/9) 91(7) ... ... 

<tt(P) ■ 92(7) • ... 

... 

<iM ■ ■ ■ 93(7) 
qiM ■ ... 313(7) . 
... 

9123 (J3) ■ ... . . 3123(7) 

Figure 1: Example edge length spectra for the tree T13 on n + 1 = 4 taxa illustrated in figure^ 

Corresponding components of the vectors q a; q^, q 7? give the three edge lengths parameters 

for the corresponding edge. The value '0 " value indicates that there is no corresponding edge 

in T . These vectors are placed in the leading row, column and main diagonal of the matrix Q. 

This means that for A, B C {1,2,3}, Q$ ; b = 9s(a), Qa,<& = Qa(P), Qa,a = Qa(i)> and for all 

other entries Qa,b = 0, except the first entry = —K, where K = K(a) + K{(3) + K(j). 

11 

The entries indicated by ". " are all zero, these are zero for every tree. The entries indicated 
by '0 " are zero for this tree T , but for different trees can be non-zero. The non-zero entries 



noting that the partition can be recovered from (A, B) . In particular 

S 1 = A n B, S a = A - S 7 , S p = B - S 1 , S e = [n] - (A U B). 

We will show that the probability pa,b of obtaining the site pattern (A, B), for each A,B £ 
[n], is a function of the edge length parameters. 

We now define another 2 n x 2 n matrix P?, the sequence probability spectrum, with rows 
and columns indexed by the subsets of [n], where 

Pt = [PA,B\A,BQ[n], 

where pab is the probability of obtaining the site pattern (A, B). 



4 Hadamard matrices and Path-sets 



We define recursively the family {H n : n G Z}, (known as Sylvester matrices), where for n > 2 



H n = H\ (gi H n -i 



H n -i —H n _i 

is a symmetric Hadamard matrix of order 2 n , with Hi and H 2 as previously defined. It is 
easily seen that H~ x = 2~ n H n . 



It is known ^H] that if we index the rows and columns of H n lexicographically by the subsets 
of [n] that: 



Observation 4 



[H n L B = h(A,B) = (-l)\ A ™\. 
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□ 



Let LTj j be the set of edges in the path in T connecting leaves i and j, G {0, 1, ... , n}) 
the entries of the transition matrix . represent the probabilities of observing the corre- 
sponding differences between the nucleotides at leaves i and j. We see further that 

M nsj = J] M eA . 

Because each edge e A in E(T) separates vertices i from j, these edges are precisely those for 
which A H {i, j} contains one, but not both elements. Hence we see 

Uij = {e A eE(T):h{A,{i,j}) = -l}. 

We generalise this, for any C C [n], finding it useful to consider the collection of edges 

Tie = {e A e E(T):h(A,C) = -1}. 

Observation 5 In [19] it is shown that: 

for \C\ = 0(mod 2), He is a set of \C\/2 edge-disjoint paths, whose endpoints are the leaves 
in C; 

for \C\ = l(mod 2), He is a set of (|C| + l)/2 edge-disjoint paths, whose endpoints are the 
leaves in C U {0}. 

□ 

LTc is called a path-set. In particular n^} = n 0i i and n^j} = Iljj comprise single paths, 
and 110 = 0. We find the set of pathsets is a group (under symmetric difference) isomorphic 
to Z™. 
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The sum of edge lengths on a path connecting two leaves can naturally be thought of as the 
distance between the leaves. We extend this distance concept, for each substitution type 
9 G {a, (3, 7} to sets of paths, to define the path-set distance 

dn c (0)= 

so that 

Y^HA,C)q A (6) = q % + KA,C)q A {9) 

Ac[n] e A €E(T) 

e A GE{T) 

= -2 «a(*) 

h(A,C)=-l 

= -2d nc (0). (5) 

Suppose each vertex v of T is assigned a character state x( v )i then for each edge e = 
(u,v) G E(T) there is a transformation tg e such that t$ e (x{u)) = x( v )- We can write 
te e = (x(' lt )) -1 x(' u ) = x( u )x( v )i as t ne transformations are Boolean. For the path Iljj 
connecting leaves % and j we find 

n = n x(«)x(«) = x(i)xti), 

eeUij e=(u,v)£E(T) 

as the products at each internal vertex cancel. Further, for any C C [n] let Co = C U {0} if 
|C| is odd, Co = C otherwise, then 

n <>* = n x«- (6) 

Suppose rieenc ^ e e { e ' a )> then the number of factors x{i) G {A 7} m equation El must be 
even, hence with 5 = 5^ U 7, ^(-B, Co) = C) = 1. Similarly if rieen c e {/^j 7}' then 
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the number of factors x{i) £ {A 7} m equation |FJ must be odd, so h(B, C ) = h(B, C) = —1. 
Generalising this we obtain 

Observation 6 If each vertex vofTis assigned character state xi v ) with A = S a U S 7 , B = 
Sp U 7 C [n], and for any C C [n] then for any C C {0, 1, • • • , n} with an even number of 
elements, let xiP) = Yliec X% then 

X (C)E{e,a}^h(B,C) = l, X (C) G {e, (3} & h(A, C) = -1. (7) 



5 Hadamard Conjugation 

Qt is the matrix containing the edge weight parameters across T. Pt is the matrix of 
probabilities of patterns at the leaves of T. The link between these are the matrix products 
H n PrpH n and H n QTH n which both relate to pathset properties. These enable to state our 
major result 

Theorem 7 

P T = H-\Ex P (H- l Q T H n ))H n , (8) 
which provided the arguments of the logarithm are positive, is invertible to give 

Q T = H^(Ln(H n P T H n ))H^, (9) 



Proof 

The proof of this theorem is based on interpreting the corresponding components, for A,B C 
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H 

[H n PTH n \ A B and [H n Q T H n ] AB . 
As the only nonzero entries in Qt are Q$ t $ and Qc,9, Q®,c, Qc,c- £c G E(T), we find 

[H n Q T H n } AB = ]T h(A,A')h{B,B')Q A , B < 

A',B'C[n] 

= Q$,0+ Y (KA,C)Qc,0 + h(B,C)Q 9)C + h(A,C)h(B,C)Qc,c 

e C &E(T) 

Y ((HA C) - 1)Q C ,$ + (KB, C) - 1)Q , C + C)/i(S, C) - 1)Q C)£7 ) 

e C efi(T) 

= Yl C ) - + (KB, C) - l)q ec (a) + (HA C)h(B, C) - l)q e , 

e c €E(T) 

= -2 Y 9ec(P)-2 Y 9eo(«)~2 £ <?ec(7) 

e c en A ecen fl ecen^Anj, 

= -2 (dn A (/3) + dn s (a) + dn A An s (7)) • 

We can partition U A U U B into three parts, 

u = u A - u B , v = u B - u A , w = u A n u B , 

and likewise split the pathset distances into components du(0) = ^2 eEU <le(@), etc. Thus 
[H n Q T H n ] AB = -2 (du(!3) + d v (i) + d v (a) + d v (j) + d w (a) + d w (J3)) , 

and 

\Exp(H n Q T H n )] AB = e -2(^(/3)+^(7)) e -2(d v ( a )+dv(7)) e -2(dw(a)+^(/3))_ ( 1Q ) 

Now, by equation 01 
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Hence equation ITU1 becomes 

[Exp(H n Q T H n )] AB = (pu(e) + p v (a) - pu(f3) - p v {l)) 

x(p v (e) -pv(a) +Pv(P) -pv(l)) (H) 
x(Pw( e ) -Pw(a) -Pwifl) +Pw(l)), 
which, when expanded, comprises the sum of 64 terms of the form 

±Pu(Q)pv((j))pw(ip), 0, (j), if) e {e, a, 7}. 

Now consider the joint probability Pr[H A - a; /3] that the product of substitutions across 
the edges of U A is t a and the product across the edges of Hb is tp. This event is attained 
by the combinations of tg across U, t^ across V and across W such that 

t$t$ = t a and t^tf = tp, 

which is attained with tg = t a t^ and = tpt^, for each i[) e {e, a, /3,j}, and hence with 
probability 

Pr[U A : a; Tl B : (3} = Pa(a)py(/3)pw(e)+pu(e)py(7)Pw(a)+Pc7(7W(e)Pw(^)+Pa(^)pv(a)pw(7), 
and these terms each occur with a + sign in equation ITT1 

Now we see the joint probability that the product of substitutions across 11^ is either e or a 
and the product across Hb is either e or /3 is 

Pr[U A : e, a; U B : e, 0] = Pr[Ti A : e; n B : a]+Pr[U A : e; n B : a]+Pr[n A : e; n B : a]+Pr[U A : e; n B : a], 

and each summand appears with positive sign in equation ^2 Similar examinations of the 
terms of equation HP gives 

[Exp(H n Q T H n )] AB = Pr[U A :e,a;U B :e,(3} + Pr[U A :(3,r,UB:a,j} 
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-Pr[U A : e, a; U B : a, 7] - Pr[Yi A : (3, 7; IT S : e, (3\. (12) 



Let Aq be the set of endpoints of Ua, {Aq = A or A U {0}, whichever is of even order). Then 

n x(u)x{v) = n *(«), 

e=(u,u)erix «gAo 
as for each internal vertex w, occurs twice in the product, which gives the identity. 

Hence Pr[II^: e, a; e, the joint probability that the product of substitutions across 11^ 

is either e or a and the product across n# is either e or (3, is the joint probability that the 

product of of states across the leaves of A Q is e or a and across the leaves of B Q is e or (3. 

Thus by equation [7| 

Pr[U A :e,a;U B :e,P] = ^ P A , B ,. 

A',B'C[n]:h(A,A')=l,h(B,B')=l 

Similarly we find 

Pr[n A :/?,7;IT B :a,7] = ^ P X , B ,, 

A' ,B' Q[n]:h{A,A')=-lMB,B')=-l 

Pr[n A : e, a; n B : a, 7] = ^ p a'b>, 

A' ,B'C[n]:h(A,A')=l,h(B,B')=-l 

Pr[n A :/3,7;IT B :e,/3] = £ P A , B ,. 

A',B'C[n]:/i(A,A')=-l,/i(-B,-B') =1 

Thus from equation combining these probabilities we find 

[Exp(H n Q T H n )] AB = HA,A')h(B,B')P A , B> , (13) 

A',B'C[n] 

giving 

Exp(P n g T P n ) = Pf n P T Pn, (14) 



from which equations |S1 and El follow. 



□ 
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